Code
library(rms)
library(ggplot2)
library(knitr)
library(dplyr)
library(survival)
library(ggplot2)
library(tibble)
library(survminer)
library(dplyr)
library(flexsurv)
library(eha)Lecture 08
library(rms)
library(ggplot2)
library(knitr)
library(dplyr)
library(survival)
library(ggplot2)
library(tibble)
library(survminer)
library(dplyr)
library(flexsurv)
library(eha)Parts of the introductory material is borrowed from Emily Zabor’s introduction to survival analysis in R.
In logistic regression, we were interested in studying how risk factors were associated with presence or absence of disease. Sometimes, though, we are interested in how a risk factor or treatment affects time to disease or some other event. Or we may have study dropout, and therefore subjects who we are not sure if they had disease or not. In these cases, logistic regression is not appropriate.
Survival analysis is used to analyze data in which the time until the event is of interest. The response is often referred to as a failure time, survival time, or event time.
Time from randomization until cardiovascular death after some treatment intervention
Time from diagnosis until tumor recurrence
Time from diagnosis until AIDS for HIV patients
Time from production until a machine part fails
Note that there needs to be a clearly defined starting and ending point
Censoring is present when we have some information about a subject’s event time, but we don’t know the exact event time. For the analysis methods we will discuss to be valid, censoring mechanism must be independent of the survival mechanism.
There are generally three reasons why censoring might occur:
These are all examples of right-censoring.
RICH JT, NEELY JG, PANIELLO RC, VOELKER CCJ, NUSSENBAUM B, WANG EW. A PRACTICAL GUIDE TO UNDERSTANDING KAPLAN-MEIER CURVES. Otolaryngology head and neck surgery: official journal of American Academy of Otolaryngology Head and Neck Surgery. 2010;143(3):331-336. doi:10.1016/j.otohns.2010.05.007.
To illustrate the impact of censoring, suppose we have the following data:
How would we compute the proportion who are event-free at 10 years?
Survival analysis techniques provide a way to appropriately account for censored patients in the analysis.
Regardless of the type of censoring, we must assume that it is non-informative about the event; that is, the censoring is caused by something other than the impending failure.
\[ S(t) = \textrm{Pr}(T > t) = 1− F(t) \]
plot(function(t) exp(-2*t), 0, 2, ylab="Survival", xlab="Time (years)")It is non-increasing
At time t = 0, S(t) = 1. In other words, the probability of surviving past time 0 is 1.
At time \(t = \infty\), \(S(t) = S(\infty) = 0\). As time goes to infinity, the survival curve goes to 0.
The hazard function, h(t), is the instantaneous rate at which events occur, given no previous events. \[ h(t) = \lim_{{\Delta t} \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t} \]
\(T\) is a random variable denoting the time until the event (for example, failure or death) occurs. The numerator in the fraction is the conditional probability that the event occurs in the small time interval \([t, t + Δt]\) given that it has not occurred before time \(t\). The denominator \(\Delta t\) is the length of this time interval. As \(\Delta t\) goes to 0, this ratio gives the instantaneous rate of occurrence of the event at time \(t\), which is the hazard function \(h(t)\).
The cumulative hazard
\[ H(t) = \int_{0}^{t} h(u) du \]
Here, h(u) is the hazard function and u is the variable of integration. This integral represents the accumulated risk of the event (for example, failure or death) over the time interval \([0, t]\).
The relationship between the survival function \(S(t)\) and the cumulative hazard function \(H(t)\)
\[ S(t) = \exp(-H(t)) \]
To analyze survival data, we need to know the observed time \(Y_i\) and the event indicator \(\delta_i\). For subject \(i\):
Observed time \(Y_i = \min(T_i, C_i)\) where \(T_i\) = event time and \(C_i\) = censoring time
Event indicator \(\delta_i\) = 1 if event observed (i.e. \(T_i \leq C_i\)), = 0 if censored (i.e. \(T_i > C_i\))
In theory the survival function is smooth; in practice we observe events on a discrete time scale.
The survival probability at a certain time, \(S(t)\), is a conditional probability of surviving beyond that time, given that an individual has survived just prior to that time. The survival probability can be estimated as the number of patients who are alive without loss to follow-up at that time, divided by the number of patients who were alive just prior to that time.
The Kaplan-Meier estimate of survival probability at a given time is the product of these conditional probabilities up until that given time
\[ \hat{S}(t) = \prod_{i: t_i \leq t} (1 - \frac{d_i}{n_i}) \]
lung datasetThe lung dataset is available from the {survival} package. The data contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. We will focus on the following variables throughout this tutorial:
Note that the status is coded in a non-standard way in this dataset. Typically you will see 1=event, 0=censored. Let’s recode it to avoid confusion:
lung <-
lung %>%
mutate(
status = recode(status, `1` = 0, `2` = 1)
)Now we have:
Here are the first 6 observations:
head(lung[, c("time", "status", "sex")]) time status sex
1 306 1 1
2 455 1 1
3 1010 0 1
4 210 1 1
5 883 1 1
6 1022 0 1
Note: the Surv() function in the {survival} package accepts by default TRUE/FALSE, where TRUE is event and FALSE is censored; 1/0 where 1 is event and 0 is censored; or 2/1 where 2 is event and 1 is censored. Please take care to ensure the event indicator is properly formatted.
The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.
The Surv() function from the {survival} package creates a survival object for use as the response in a model formula. There will be one entry for each subject that is the survival time, which is followed by a + if the subject was censored. Let’s look at the first 10 observations:
Surv(lung$time, lung$status)[1:10] [1] 306 455 1010+ 210 883 1022+ 310 361 218 166
We see that subject 1 had an event at time 306 days, subject 2 had an event at time 455 days, subject 3 was censored at time 1010 days, etc.
The survfit() function creates survival curves using the Kaplan-Meier method based on a formula. Let’s generate the overall survival curve for the entire cohort, assign it to object s1, and look at the structure using str():
s1 <- survfit(Surv(time, status) ~ 1, data = lung)
str(s1)List of 16
$ n : int 228
$ time : num [1:186] 5 11 12 13 15 26 30 31 53 54 ...
$ n.risk : num [1:186] 228 227 224 223 221 220 219 218 217 215 ...
$ n.event : num [1:186] 1 3 1 2 1 1 1 1 2 1 ...
$ n.censor : num [1:186] 0 0 0 0 0 0 0 0 0 0 ...
$ surv : num [1:186] 0.996 0.982 0.978 0.969 0.965 ...
$ std.err : num [1:186] 0.0044 0.00885 0.00992 0.01179 0.01263 ...
$ cumhaz : num [1:186] 0.00439 0.0176 0.02207 0.03103 0.03556 ...
$ std.chaz : num [1:186] 0.00439 0.0088 0.00987 0.01173 0.01257 ...
$ type : chr "right"
$ logse : logi TRUE
$ conf.int : num 0.95
$ conf.type: chr "log"
$ lower : num [1:186] 0.987 0.966 0.959 0.947 0.941 ...
$ upper : num [1:186] 1 1 0.997 0.992 0.989 ...
$ call : language survfit(formula = Surv(time, status) ~ 1, data = lung)
- attr(*, "class")= chr "survfit"
Some key components of this survfit object that will be used to create survival curves include:
time: the timepoints at which the curve has a step, i.e. at least one event occurredsurv: the estimate of survival at the corresponding timesummary(s1)Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 228 1 0.9956 0.00438 0.9871 1.000
11 227 3 0.9825 0.00869 0.9656 1.000
12 224 1 0.9781 0.00970 0.9592 0.997
13 223 2 0.9693 0.01142 0.9472 0.992
15 221 1 0.9649 0.01219 0.9413 0.989
26 220 1 0.9605 0.01290 0.9356 0.986
30 219 1 0.9561 0.01356 0.9299 0.983
31 218 1 0.9518 0.01419 0.9243 0.980
53 217 2 0.9430 0.01536 0.9134 0.974
54 215 1 0.9386 0.01590 0.9079 0.970
59 214 1 0.9342 0.01642 0.9026 0.967
60 213 2 0.9254 0.01740 0.8920 0.960
61 211 1 0.9211 0.01786 0.8867 0.957
62 210 1 0.9167 0.01830 0.8815 0.953
65 209 2 0.9079 0.01915 0.8711 0.946
71 207 1 0.9035 0.01955 0.8660 0.943
79 206 1 0.8991 0.01995 0.8609 0.939
81 205 2 0.8904 0.02069 0.8507 0.932
88 203 2 0.8816 0.02140 0.8406 0.925
92 201 1 0.8772 0.02174 0.8356 0.921
93 199 1 0.8728 0.02207 0.8306 0.917
95 198 2 0.8640 0.02271 0.8206 0.910
105 196 1 0.8596 0.02302 0.8156 0.906
107 194 2 0.8507 0.02362 0.8056 0.898
110 192 1 0.8463 0.02391 0.8007 0.894
116 191 1 0.8418 0.02419 0.7957 0.891
118 190 1 0.8374 0.02446 0.7908 0.887
122 189 1 0.8330 0.02473 0.7859 0.883
131 188 1 0.8285 0.02500 0.7810 0.879
132 187 2 0.8197 0.02550 0.7712 0.871
135 185 1 0.8153 0.02575 0.7663 0.867
142 184 1 0.8108 0.02598 0.7615 0.863
144 183 1 0.8064 0.02622 0.7566 0.859
145 182 2 0.7975 0.02667 0.7469 0.852
147 180 1 0.7931 0.02688 0.7421 0.848
153 179 1 0.7887 0.02710 0.7373 0.844
156 178 2 0.7798 0.02751 0.7277 0.836
163 176 3 0.7665 0.02809 0.7134 0.824
166 173 2 0.7577 0.02845 0.7039 0.816
167 171 1 0.7532 0.02863 0.6991 0.811
170 170 1 0.7488 0.02880 0.6944 0.807
175 167 1 0.7443 0.02898 0.6896 0.803
176 165 1 0.7398 0.02915 0.6848 0.799
177 164 1 0.7353 0.02932 0.6800 0.795
179 162 2 0.7262 0.02965 0.6704 0.787
180 160 1 0.7217 0.02981 0.6655 0.783
181 159 2 0.7126 0.03012 0.6559 0.774
182 157 1 0.7081 0.03027 0.6511 0.770
183 156 1 0.7035 0.03041 0.6464 0.766
186 154 1 0.6989 0.03056 0.6416 0.761
189 152 1 0.6943 0.03070 0.6367 0.757
194 149 1 0.6897 0.03085 0.6318 0.753
197 147 1 0.6850 0.03099 0.6269 0.749
199 145 1 0.6803 0.03113 0.6219 0.744
201 144 2 0.6708 0.03141 0.6120 0.735
202 142 1 0.6661 0.03154 0.6071 0.731
207 139 1 0.6613 0.03168 0.6020 0.726
208 138 1 0.6565 0.03181 0.5970 0.722
210 137 1 0.6517 0.03194 0.5920 0.717
212 135 1 0.6469 0.03206 0.5870 0.713
218 134 1 0.6421 0.03218 0.5820 0.708
222 132 1 0.6372 0.03231 0.5769 0.704
223 130 1 0.6323 0.03243 0.5718 0.699
226 126 1 0.6273 0.03256 0.5666 0.694
229 125 1 0.6223 0.03268 0.5614 0.690
230 124 1 0.6172 0.03280 0.5562 0.685
239 121 2 0.6070 0.03304 0.5456 0.675
245 117 1 0.6019 0.03316 0.5402 0.670
246 116 1 0.5967 0.03328 0.5349 0.666
267 112 1 0.5913 0.03341 0.5294 0.661
268 111 1 0.5860 0.03353 0.5239 0.656
269 110 1 0.5807 0.03364 0.5184 0.651
270 108 1 0.5753 0.03376 0.5128 0.645
283 104 1 0.5698 0.03388 0.5071 0.640
284 103 1 0.5642 0.03400 0.5014 0.635
285 101 2 0.5531 0.03424 0.4899 0.624
286 99 1 0.5475 0.03434 0.4841 0.619
288 98 1 0.5419 0.03444 0.4784 0.614
291 97 1 0.5363 0.03454 0.4727 0.608
293 94 1 0.5306 0.03464 0.4669 0.603
301 91 1 0.5248 0.03475 0.4609 0.597
303 89 1 0.5189 0.03485 0.4549 0.592
305 87 1 0.5129 0.03496 0.4488 0.586
306 86 1 0.5070 0.03506 0.4427 0.581
310 85 2 0.4950 0.03523 0.4306 0.569
320 82 1 0.4890 0.03532 0.4244 0.563
329 81 1 0.4830 0.03539 0.4183 0.558
337 79 1 0.4768 0.03547 0.4121 0.552
340 78 1 0.4707 0.03554 0.4060 0.546
345 77 1 0.4646 0.03560 0.3998 0.540
348 76 1 0.4585 0.03565 0.3937 0.534
350 75 1 0.4524 0.03569 0.3876 0.528
351 74 1 0.4463 0.03573 0.3815 0.522
353 73 2 0.4340 0.03578 0.3693 0.510
361 70 1 0.4278 0.03581 0.3631 0.504
363 69 2 0.4154 0.03583 0.3508 0.492
364 67 1 0.4092 0.03582 0.3447 0.486
371 65 2 0.3966 0.03581 0.3323 0.473
387 60 1 0.3900 0.03582 0.3258 0.467
390 59 1 0.3834 0.03582 0.3193 0.460
394 58 1 0.3768 0.03580 0.3128 0.454
426 55 1 0.3700 0.03580 0.3060 0.447
428 54 1 0.3631 0.03579 0.2993 0.440
429 53 1 0.3563 0.03576 0.2926 0.434
433 52 1 0.3494 0.03573 0.2860 0.427
442 51 1 0.3426 0.03568 0.2793 0.420
444 50 1 0.3357 0.03561 0.2727 0.413
450 48 1 0.3287 0.03555 0.2659 0.406
455 47 1 0.3217 0.03548 0.2592 0.399
457 46 1 0.3147 0.03539 0.2525 0.392
460 44 1 0.3076 0.03530 0.2456 0.385
473 43 1 0.3004 0.03520 0.2388 0.378
477 42 1 0.2933 0.03508 0.2320 0.371
519 39 1 0.2857 0.03498 0.2248 0.363
520 38 1 0.2782 0.03485 0.2177 0.356
524 37 2 0.2632 0.03455 0.2035 0.340
533 34 1 0.2554 0.03439 0.1962 0.333
550 32 1 0.2475 0.03423 0.1887 0.325
558 30 1 0.2392 0.03407 0.1810 0.316
567 28 1 0.2307 0.03391 0.1729 0.308
574 27 1 0.2221 0.03371 0.1650 0.299
583 26 1 0.2136 0.03348 0.1571 0.290
613 24 1 0.2047 0.03325 0.1489 0.281
624 23 1 0.1958 0.03297 0.1407 0.272
641 22 1 0.1869 0.03265 0.1327 0.263
643 21 1 0.1780 0.03229 0.1247 0.254
654 20 1 0.1691 0.03188 0.1169 0.245
655 19 1 0.1602 0.03142 0.1091 0.235
687 18 1 0.1513 0.03090 0.1014 0.226
689 17 1 0.1424 0.03034 0.0938 0.216
705 16 1 0.1335 0.02972 0.0863 0.207
707 15 1 0.1246 0.02904 0.0789 0.197
728 14 1 0.1157 0.02830 0.0716 0.187
731 13 1 0.1068 0.02749 0.0645 0.177
735 12 1 0.0979 0.02660 0.0575 0.167
765 10 1 0.0881 0.02568 0.0498 0.156
791 9 1 0.0783 0.02462 0.0423 0.145
814 7 1 0.0671 0.02351 0.0338 0.133
883 4 1 0.0503 0.02285 0.0207 0.123
\[ \hat{S}(t) = \prod_{i: t_i \leq t} (1 - \frac{d_i}{n_i}) \]
\(S(0) = 1\)
\(S(5) = 1*(1-1/228) = 0.9956\)
\(S(11) = 1*(1-1/228)*(3-1/227) = 0.9825\)
etc.
Number at risk is decrease by events or censored observations
Most early times are events, so number at risk is decreasing because of events. – The first censored observation is at 92 days. Here, the risk set decreases by the number of events and number censored (find this time point in the table to verify yourself)
ggsurvplot(s1, data=lung, risk.table = TRUE)One quantity often of interest in a survival analysis is the probability of surviving beyond a certain number of years, \(x\).
For example, to estimate the probability of surviving to \(1\) year, use summary with the times argument (Note: the time variable in the lung data is actually in days, so we need to use times = 365.25)
summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
365 65 121 0.409 0.0358 0.345 0.486
We find that the \(1\)-year probability of survival in this study is 41%.
The associated lower and upper bounds of the 95% confidence interval are also displayed.
The \(1\)-year survival probability is the point on the y-axis that corresponds to \(1\) year on the x-axis for the survival curve.
What happens if you use a “naive” estimate? Here “naive” means that the patients who were censored prior to 1-year are considered event-free and included in the denominator.
121 of the 228 patients in the lung data died by \(1\) year so the “naive” estimate is calculated as:
\[\Big(1 - \frac{121}{228}\Big) \times 100 = 47\%\] You get an incorrect estimate of the \(1\)-year probability of survival when you ignore the fact that 42 patients were censored before \(1\) year.
Recall the correct estimate of the \(1\)-year probability of survival, accounting for censoring using the Kaplan-Meier method, was 41%.
Ignoring censoring leads to an overestimate of the overall survival probability. Imagine two studies, each with 228 subjects. There are 165 deaths in each study. Censoring is ignored in one (blue line), censoring is accounted for in the other (yellow line). The censored subjects only contribute information for a portion of the follow-up time, and then fall out of the risk set, thus pulling down the cumulative probability of survival. Ignoring censoring erroneously treats patients who are censored as part of the risk set for the entire follow-up period.
Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median. Survival times are not expected to be normally distributed so the mean is not an appropriate summary.
We can obtain the median survival directly from the survfit object:
survfit(Surv(time, status) ~ 1, data = lung)Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
n events median 0.95LCL 0.95UCL
[1,] 228 165 310 285 363
We see the median survival time is 310 days The lower and upper bounds of the 95% confidence interval are also displayed.
Median survival is the time corresponding to a survival probability of \(0.5\):
The most common test for comparing survival by groups is the log-rank test
At each event time, constructs a 2x2 table of observed events and number at risk for each group
Compare observed to expected number of events under the null hypothesis that the event rates are the same in each group
A Chi-squared test stratifying over all event times
A test of the entire survival curve
We get the log-rank p-value using the survdiff function. For example, we can test whether there was a difference in survival time according to sex in the lung data:
survdiff(Surv(time, status) ~ sex, data = lung)Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=1 138 112 91.6 4.55 10.3
sex=2 90 53 73.4 5.68 10.3
Chisq= 10.3 on 1 degrees of freedom, p= 0.001
fit.sex <- survfit(Surv(time, status) ~ sex, data=lung)
ggsurvplot(fit.sex, data=lung, pval=TRUE)There was a significant difference in overall survival according to sex in the lung data, with a p-value of p = ‘round(1-pchisq(survdiff(Surv(time, status) ~ sex, data = lung)$chisq, 1),3)’.
The preceding approach was very flexible in that it makes no parametric assumptions above the shape of the survival curve
This can work for well for categorical predictors and/or descriptive purposes
In a clinical trial where we have randomized to treatment groups (and presumably have minimized confounding), the Kaplan-Meier method and log-rank test can be the primary analysis method
Extensions are needed if we want to consider continuous predictors and multivariable models
Would be nice to maintain flexibility and have a readily interpret
Statistical methods
Regression models we have studied to date are generally not valid for survival data
Because of right censoring, survival time cannot be analyzed as a continuous outcome (e.g. linear regression)
Because of unequal length of followup, survival (yes/no) cannot be analyzed using logistic regression
In the presence of censored data, the usual descriptive statistics are not appropriate
Sample mean, sample median, simple proportions, sample standard deviation should not be used
Proper descriptives should be based on the Kaplan Meier estimates
Specialized regression models are needed with censored data
(Cox) Proportional hazards model
Considers the instantaneous risk of failure at each time among those subjects who have not failed
The term “proportional hazards” assumes that the ratio of these instantaneous failure rates is constant in time between groups
Proportional hazards (Cox) regression treats the survival distribution within a group semi-parametrically
The baseline hazard is estimated without making any distributional assumptions
The hazard ratio is the parameter of interest and is used to compare groups
Looks at odds of choosing subjects relative to prevalence in the population
Can be derived as estimating the odds ratio of an event at each time that an event occurs
Proportional hazards model averages the odds ratio across all observed event times
If the odds ratio is constant over time between two groups, such an average results in a precise estimate of the hazard ratio
Borrowing information
Uses other groups to make estimates in groups with sparse data
Borrows information across predictor groups
Also borrows information over time
Modeling the log hazard over time as a function of covariates \(X\)
“Baseline hazard” is unspecified. Baseline hazard is similar to an intercept
| Model | \(\textrm{log}(\lambda(t | X_i)) = \textrm{log}(\lambda_{0}(t)) + \beta_1 \times X_i\) |
|---|---|
| \(X_i = 0\) | log hazard at \(t\) is \(\textrm{log}(\lambda_{0}(t))\) |
| \(X_i = x\) | log hazard at \(t\) is \(\textrm{log}(\lambda_{0}(t)) + \beta_1 \times x\) |
| \(X_i = x+1\) | log hazard at \(t\) is \(\textrm{log}(\lambda_{0}(t))+ \beta_1 \times x + \beta_1\) |
Model on the hazard scale is found by exponentiating parameters
| Model | \(\lambda(t | X_i) = \lambda_{0}(t) \times e^{\beta_1 \times X_i}\) |
|---|---|
| \(X_i = 0\) | hazard at \(t\) is \(\lambda_{0}(t)\) |
| \(X_i = x\) | hazard at \(t\) is \(\lambda_{0}(t) \times e^{\beta_1 \times x}\) |
| \(X_i = x+1\) | hazard at \(t\) is \(\lambda_{0}(t) \times e^{\beta_1 \times x} \times e^{\beta_1}\) |
Interpretation of the model
No intercept
Slope parameter
Hazard ratio found by exponentiating the slope from the PH regression: \(\textrm{exp}(\beta_1)\)
Hazard ratio compared groups differing in the value of the predictor by 1 unit
Relationship to survival
Hazard function determines the survival function
| Hazard | \(\lambda(t | X_i) = \lambda_{0}(t) \times e^{\beta_1 \times X_i}\) |
| Cumulative Hazard | \(\Lambda(t | X_i) = \displaystyle\int^t_0 \lambda_{0}(u) \times e^{\beta_1 \times X_i} \ du\) |
| Survival Function | \(S(t | X_i) = e^{-\Lambda(t|X_i)} = [S_0(t)]^{e^{\beta_1 \times X_i}}\) |
set.seed(55)
x <- seq(0,2*pi,by=0.1)
lambda0 <- (2*sin(x) + 4 + rnorm(length(x), 0, .2)) / 100
lambda0[1] <- 0
plot(x, lambda0, type='l', xlab="Time", ylab="Hazard", ylim=c(0,.1), main="Baseline Hazard Over Time", lwd=2)
legend("top", inset=0.05, lty=1, expression(lambda[0] (t)), lwd=2)plot(x, cumsum(lambda0), type='l', xlab="Time", ylab="Cumulative Hazard", ylim=c(0,3), main="Baseline Cumulative Hazard over Time", lwd=2)
legend("top", inset=0.05, lty=1, expression(Lambda[0] (t) == integral(lambda[0] (u)*du, 0, t)), lwd=2)plot(x, exp(-cumsum(lambda0)), type='l', xlab="Time", ylab="Survival", ylim=c(0,1), main="Baseline Survival over Time", lwd=2)
legend("top", inset=0.05, lty=1, expression(exp(-Lambda[0] (t))), lwd=2)lambda1 <- lambda0*1.5
plot(x, lambda0, type='l', xlab="Time", ylab="Hazard", ylim=c(0,.1), main="Proportional Hazards for Two Groups", lwd=2)
lines(x, lambda1, col="Red", lwd=2)
legend("topright", inset=0.05, lty=1, c(expression(lambda[0] (t)), expression(lambda[1] (t) == 1.5*lambda[0] (t))), lwd=2, col=c("Black","Red"))plot(x, log(lambda0), type='l', xlab="Time", ylab="Log Hazard", ylim=c(-4.5,-2), main="Proportional Hazards for Two Groups, log scale", lwd=2)
lines(x, log(lambda1), col="Red", lwd=2)
legend("topright", inset=0.05, lty=1, c(expression(log(lambda[0] (t))), expression(log(lambda[1] (t)) == log(1.5) + log(lambda[0] (t)))), lwd=2, col=c("Black","Red"))Comments on plots
Baseline hazard can follow any functional form. This is the “non-parametric” part of the Cox proportional hazards model
The cumulative hazard is a non-decreasing function that starts at 0 at time 0. It is bounded by \(\infty\).
Survival is a function of the cumulative hazard. Survival is 1 at time 0 and decreases over time. It is bounded by \(0\) and \(1\).
For a dichotomous predictor variable (two groups), the proportional hazards assumption is that \(\lambda_0 (t) = e^{\beta_1} \lambda_0 (t)\)
In the plots I illustrated \(e^{\beta_1} = 1.5\) (\(\beta_1 = 0.4054\)).
On the hazard scale, this corresponds to \(\lambda_1(t)\) always being \(1.5\) times larger than \(\lambda_0(t)\).
On the log hazard scale, this corresponds to \(log(\lambda_1(t))\) always being \(log(1.5) = 0.4054\) units larger than \(log(\lambda_0(t))\).
The proportional hazards assumption is the “parametric” part of the Cox proportional hazards model. Hence, the Cox proportional hazards model is referred to as being “semi-parametric”.
Stata commands
“stset time event-indicator”
“stcox predictor, [robust]”
R functions
“Surv(time, event-indicator)”
“coxph(Surv(time, event-indicator) ~ predictor)”
There is a robust option in coxph, and it defaults to TRUE. In my opinion, this is a good default to have because the assumption of proportional hazards in the Cox model rarely (never?) holds completely.
Prognostic values of nadir PSA relative to time in remission
PSA dataset: 50 men who received hormonal treatment for advanced prostate cancer
Followed at least 24 months for clinical progression, but exact followup varies from subject to subject
Nadir PSA: lowest level of serum prostate specific antigen achieved post treatment
Scatterplots of censored data are not scientifically meaningful
It is best to not generate them unless you do something to indicate the censored dataset
We can label censored data, but have to remember that the true value may be anywhere larger than that
library(foreign)
library(rms)
psa <- read.dta(file="data/psa.dta")
psa$nadirpsa <- psa$dirpsa
ggplot(psa, aes(y=obstime, x=nadirpsa, color=inrem)) + geom_point() + theme_bw()Characterization of plot
Outliers: Can’t tell
First order trends
Downward trending slop
No censoring at high nadir PSAs (high nadir PSA are all early events)
Second order trend: Must be curvilinear (but how much?)
Variability within groups: Highest variability within lowest PSA groups
# First, create an indicator variable for those who relapsed at any time point
psa$relapse <- as.numeric(psa$inrem=="no")
with(psa, table(relapse, inrem)) inrem
relapse no yes
0 0 14
1 36 0
Create a relapse variable to indicate if a subject has relapsed or not
Define the outcome using “Surv(obstime, relapse)”
obstime: Time to event (either time to relapse or time to censored)
relapse: Failure indicator variable
m1 <- coxph(Surv(obstime, relapse) ~ nadirpsa, data=psa)
m1Call:
coxph(formula = Surv(obstime, relapse) ~ nadirpsa, data = psa)
coef exp(coef) se(coef) z p
nadirpsa 0.01572 1.01584 0.00375 4.191 2.77e-05
Likelihood ratio test=11.78 on 1 df, p=0.0005989
n= 50, number of events= 36
summary(m1)Call:
coxph(formula = Surv(obstime, relapse) ~ nadirpsa, data = psa)
n= 50, number of events= 36
coef exp(coef) se(coef) z Pr(>|z|)
nadirpsa 0.01572 1.01584 0.00375 4.191 2.77e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
nadirpsa 1.016 0.9844 1.008 1.023
Concordance= 0.757 (se = 0.037 )
Likelihood ratio test= 11.78 on 1 df, p=6e-04
Wald test = 17.57 on 1 df, p=3e-05
Score (logrank) test = 24.3 on 1 df, p=8e-07
exp(confint.default(m1)) 2.5 % 97.5 %
nadirpsa 1.008403 1.023334
# 10-unit change in nadir PSA
exp(coef(m1)*10)nadirpsa
1.170192
exp(10*confint.default(m1)) 2.5 % 97.5 %
nadirpsa 1.087276 1.259431
By default, R gives the hazard ratio andthe coefficient for comparing groups who differ by one unit in the nadir psa covariate
\(\textrm{Hazard ratio} = 1.015^{\Delta nadir}\)
Estimated hazard ratio for two groups differing by 1 in nadir PSA is found by exponentiating the slope
Groups 1 unit higher nadir PSA have instantaneous event rate \(1.0157\) fold higher (1.6% higher)
Groups 10 units higher nadir PSA have instantaneous event rate \(1.015^{10} = 1.166\) fold higher (16.6% higher)
By default, R uses robust standard errors in the coxph function. Note that other functions for fitting Cox models in R may or may not have this same default
“From proportional hazards regression analysis, we estimate that for each 1 ng/ml unit difference in nadir PSA, this risk of relapse is \(1.6\%\) higher in the group with the higher nadir PSA. This estimate is highly statistically significant (\(p < 0.001\)). A 95% CI suggests that this observation is not unusual if a group that has a 1 ng/ml higher nadir PSA might have risk of relapse that was anywhere from \(0.8\%\) higher to \(2.3\%\) higher than the group with lower nadir PSA.”
The ideas of Signal and Noise found in simple linear regression do not translate well to PH regression
Valid statistical inference (CIs, p-values) about associations requires three general assumptions
Assumptions about approximately Normal distributions for the parameter estimates
Large N
Need for either robust standard errors or classical regression
Definition of large depends on the underlying probability distribution
Assumptions about independence of observations
Classical regression: Independence of all observation
Robust standard errors: Correlated observations within identified clusters
Assumptions about variance of observations within groups
Classical regression: Mean-variance relationship for binary data
Proportional hazards considers the hazard of event at every time
Hence in order to satisfy this requirement, need proportional hazards and linearity of predictor
Robust standard errors
Allows unequal variance across groups
Hence, do not need linearity of predictor or proportional hazards
Valid statistical inference (CIs, p-values) about hazard of response in specific groups requires a further assumption
Assumption about the adequacy of the linear model
If we are trying to borrow information about the log hazards from neighboring groups, and we are assuming a straight line relationship, the straight line needs to be true
Needed for either classical or robust standard errors
Note that we can model transformations of the measured predictor
We rarely make inference about within group survival probabilities using the proportional hazards model
Sometimes estimated survival curves are used descriptively
Use estimates of the baseline survival function
Exponentiate the baseline survival to find survival curve for specific covariate patterns
Relationship to survival
Hazard function determines the survival function.
For the Cox model
| Hazard | \(\lambda(t | X_i) = \lambda_{0}(t) \times e^{\beta_1 \times X_i}\) |
| Cumulative Hazard | \(\Lambda(t | X_i) = \displaystyle\int^t_0 \lambda_{0}(u) \times e^{\beta_1 \times X_i} \ du\) |
| Survival Function | \(S(t | X_i) = e^{-\Lambda(t|X_i)} = [S_0(t)]^{e^{\beta_1 \times X_i}}\) |
Suppose that you know based on prior experience or consultation with collaborators…
A constant difference in PSA would not be expected to confer the same increased risk
A multiplicative effect on risk might be better
Same increase in risk for each doubling of nadir PSA
Achieve this model by using log transformed nadir PSA
psa$lognadirpsa <- log(psa$nadirpsa)
m2 <- coxph(Surv(obstime, relapse) ~ lognadirpsa, data=psa)
m2Call:
coxph(formula = Surv(obstime, relapse) ~ lognadirpsa, data = psa)
coef exp(coef) se(coef) z p
lognadirpsa 0.4372 1.5484 0.0877 4.986 6.17e-07
Likelihood ratio test=24.01 on 1 df, p=9.601e-07
n= 50, number of events= 36
summary(m2)Call:
coxph(formula = Surv(obstime, relapse) ~ lognadirpsa, data = psa)
n= 50, number of events= 36
coef exp(coef) se(coef) z Pr(>|z|)
lognadirpsa 0.4372 1.5484 0.0877 4.986 6.17e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
lognadirpsa 1.548 0.6458 1.304 1.839
Concordance= 0.757 (se = 0.037 )
Likelihood ratio test= 24.01 on 1 df, p=1e-06
Wald test = 24.86 on 1 df, p=6e-07
Score (logrank) test = 29.02 on 1 df, p=7e-08
exp(confint.default(m2)) 2.5 % 97.5 %
lognadirpsa 1.303896 1.838834
Hazard ratio is \(1.55\) for a \(e\)-fold difference in nadir PSA (\(e = 2.7183\))
It is more easy to understand doubling, tripling, 5-fold, 10-fold increases
For doubling: \(\textrm{HR} = 1.54^{log(2)} = 1.35\)
For 5-fold: \(\textrm{HR} = 1.54^{log(5)} = 1.99\)
Can similarly transform the upper and lower limits of the confidence interval
The confidence interval and statistical test given in the output is called a Wald test. Other tests (Score, Likelihood Ratio) are also possible.
All tests are asymptotically equivalent
The Wald test is easiest to obtain, but generally performs the poorest in small sample sizes. It is based on the coefficient estimate and standard error.
The Likelihood Ratio test performs the best in small samples. We will discuss it later, including how to obtain the test using post-estimation commands.
The Score test is not bad in small samples, but is often hard to obtain from software. It is exactly equal to the logrank test for binary outcomes and categorical predictors.
| Parameter | Approach |
|---|---|
| Means | Linear regression |
Interpretation of non-transformed slope: \(\theta_X = \beta_0 + \beta_1 \times X\)
\(\beta_1\) : (Average) difference in summary measure between groups per 1 unit difference in \(X\)
\(\Delta \times \beta_1\) : (Average) difference in summary measure between groups per \(\Delta\) unit difference in \(X\)
Interpretation with log transformed slope: \(\theta_X = \beta_0 + \beta_1 \times \textrm{log}(X)\)
| Parameter | Approach |
|---|---|
| Geometric means | Linear regression on log scale |
| Odds | Logistic regression |
| Rates | Poisson regression |
| Hazards | Proportional Hazards (Cox) regression |
Interpretation of non-transformed slope: \(\textrm{log}(\theta_X) = \beta_0 + \beta_1 \times X\)
\(e^{\beta_1}\) : (Average) ratio of summary measure between groups per 1 unit difference in \(X\)
\(e^{\Delta \times \beta_1}\) : (Average) ratio of summary measure between groups per \(\Delta\) unit difference in \(X\)
Interpretation with log transformed slope: \(\textrm{log}(\theta_X) = \beta_0 + \beta_1 \times \textrm{log}(X)\)
Fully-parametric models are less common than the semi-parametric Cox model because they are less flexible
However, if they are a good fit to a particular dataset, they may
Estimate fewer parameters
Allow for better predictions (as long as the error distribution is correctly specified!)
All for comparison of survival estimates rather than hazards (accelerated failure time)
Two fundamental models used to describe the way that some factor might affect time to event
Proportional hazards (aka Cox model)
Accelerated failure time (less popular)
Accelerated Failure Time model
Comparisons across groups based on survival times rather than hazards
Assume that a factor causes some subjects to spend their lifetime too fast
Basic idea: For every year in a reference group’s lives, the other group ages “k” years
e.g. 1 human year is about 7 dog years, so a 70 year old human is the same age as a 10 year old dog
In AFT terminology, the probability that a human survives beyond 70 years is the same as the probability that a dog survives beyond 10 years
From the framework, dogs are accelerating through life 7 times faster than a human. Equivalently, the lifespan of a human is stretched out to be 7 times longer than that of a dog.
Assumes that ratios of quantiles of survival distribution are constant across groups
AFT models include the parametric exponential, Weibull, and lognormal models
The Weibull model can also be interpreted as either a proportional hazard or accelerated failure time model. It is one of the few models that makes both the proportional hazards and accelerated failure time assumptions.
The Weibull model accommodates monotonically increasing or decreasing hazards
The following gives the relationship between the hazard function, Survival function, distribution function and cumulative distribution function for four different shape parameters following a Weibull distribution with scale parameter held constant at 2
plot.data <- expand.grid(
t = seq(0.1, 10, .1),
loc = c(.5, 1, 1.5, 2)
) %>%
mutate(
`F (CDF)` = pweibull(t, shape = loc, scale = 2),
`f (PDF)` = dweibull(t, shape = loc, scale = 2),
S = 1 - `F (CDF)`,
`h = f / S` = flexsurv::hweibull(t, shape = loc, scale = 2),
loc = paste("shape:", loc)
)
plot.data <- expand.grid(t=seq(0.1, 10, .1),
loc=c(.5, 1, 1.5, 2))
plot.data2 <- with(plot.data,
data.frame(t=rep(t,4),
loc=rep(loc,4),
value=NA,
type=rep(1:4,each=nrow(plot.data))))
plot.data2$value[plot.data2$type==1] <-
pweibull(plot.data2$t[plot.data2$type==1], shape = plot.data2$loc[plot.data2$type==1], scale = 2)
plot.data2$value[plot.data2$type==2] <-
dweibull(plot.data2$t[plot.data2$type==2], shape = plot.data2$loc[plot.data2$type==2], scale = 2)
plot.data2$value[plot.data2$type==3] <-
1-pweibull(plot.data2$t[plot.data2$type==3], shape = plot.data2$loc[plot.data2$type==3], scale = 2)
plot.data2$value[plot.data2$type==4] <-
flexsurv::hweibull(plot.data2$t[plot.data2$type==4], shape = plot.data2$loc[plot.data2$type==4], scale = 2)
plot.data2$type <- factor(plot.data2$type, levels=1:4, labels=c("F (CDF)","f (PDF)", "S = 1 - F", "h = f / S" ))
plot.data2$loc2 <- paste("scale =", plot.data2$loc)
ggplot(plot.data2, aes(x=t, y=value)) + geom_line() + facet_grid(type ~ loc2, scales="free_y") + theme_bw()\[h(t) = λpt^{(p−1)}\]
\(p\) is a shape parameter
\(p>1\) is an increasing hazard over time
\(p=1\) is a constant hazard
\(p<1\) is a decreasing hazard
We can model the scale as a function of covariates
\[\textrm{where } \lambda = \textrm{exp}(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p)\]
A special case of the Weibull model is the Exponential model. The Weibull reduces to the Exponential when the shape parameter is 1 (\(p=1\)). A shape parameter of 1 assumes the hazard is constant over time, and the cumulative hazard increases linearly with time.
Nelson-Aalen estimator of the cumulative hazard (non-parametric)
\[\hat{H}(t)=\sum_{i:t_i\leq t}\frac{d_i}{n_i}\]
where \(d_i\) is the number of events at time \(t_i\), and \(n_i\) is the number of individuals at risk just before time \(t_i\)
Apply this to the lung cancer dataset
# Calculate the non-parametric estimate of the cumulative hazard
n.risk <- summary(s1)$n.risk
n.evt <- summary(s1)$n.event
time <- summary(s1)$time
# Plot cumulative hazard versus time
na.haz <- n.risk/n.evt
na.cumhaz <- cumsum(na.haz)
plot(time, na.cumhaz)fit2 <- survreg(Surv(time, status) ~ sex, data = lung, dist = "weibull")
pct <- 1:98/100
ptime1 <- predict(fit2, newdata=data.frame(sex=1), type='quantile',
p=pct)
ptime2 <- predict(fit2, newdata=data.frame(sex=2), type='quantile',
p=pct)
# Kaplan-Meier fit
plot(fit.sex, col=1:2, ylab="Survival")
lines(ptime1, 1-pct, col=1,lty=2)
lines(ptime2, 1-pct, col=2,lty=2)
legend("topright", col=c(1,1,2,2), lty=c(1,2,1,2),
c("Kaplan-Meier, Male","Weibull, Male","Kaplan-Meier, Female","Weibull, Female"))The Weibull model is both a proportional hazards and accelerated failure time model
How the model is parameterized will either give a proportional hazards or accelerated failure time interpretation
For the Weibull proportional hazards model with one covariate ($X$)
\[ h(t) = \lambda p t^{p-1} \]
\[ \lambda = \textrm{exp}(\beta_1 X) \]
For the Weibull accelerated failure time model with one covariate ($X$)
\[ S(t) = exp(-\lambda t^p) \]
Solve for \(t\)
\[ t = (-\textrm{ln} S(t))^{1/p} \times \frac{1}{\lambda^{1/p}} \]
\[ \frac{1}{\lambda^{1/p}} = \textrm{exp}(\alpha_0 + \alpha_1 X) \]
R default is to give the proportional hazards parameterization. The hazard ratio is found by exponentiation of the slope as we have done with other models
Compare to females, males are at a 1.48 fold increased risk (hazard) of death.
summary(fit2)
Call:
survreg(formula = Surv(time, status) ~ sex, data = lung, dist = "weibull")
Value Std. Error z p
(Intercept) 5.4886 0.1790 30.66 < 2e-16
sex 0.3956 0.1276 3.10 0.0019
Log(scale) -0.2809 0.0619 -4.54 5.7e-06
Scale= 0.755
Weibull distribution
Loglik(model)= -1148.7 Loglik(intercept only)= -1153.9
Chisq= 10.4 on 1 degrees of freedom, p= 0.0013
Number of Newton-Raphson Iterations: 5
n= 228
exp(coef(fit2))(Intercept) sex
241.914353 1.485242
exp(confint.default(fit2)) 2.5 % 97.5 %
(Intercept) 170.322286 343.598925
sex 1.156491 1.907447
m.aft <- aftreg(formula = Surv(time, status) ~ sex, data = lung,
dist = "weibull")
m.aftCall:
aftreg(formula = Surv(time, status) ~ sex, data = lung, dist = "weibull")
Covariate W.mean Coef Time-Accn se(Coef) Wald p
sex 1.438 -0.395 0.673 0.128 0.002
Baseline parameters:
log(scale) 5.489 0.179 0.000
log(shape) 0.281 0.062 0.000
Baseline life expectancy: 223
Events 165
Total time at risk 69593
Max. log. likelihood -1148.7
LR test statistic 10.4
Degrees of freedom 1
Overall p-value 0.00126068
exp(confint(m.aft)) 2.5 % 97.5 %
sex 0.524437 0.8648488
log(scale) 170.416568 343.7261363
log(shape) 1.173370 1.4956268
Acceleration factor (males compared to females) is 0.673
# 25th, 50th, and 75th percentiles of survival time in males
ptime1[pct%in%c("0.25","0.5","0.75")][1] 140.2460 272.4383 459.8036
# 25th, 50th, and 75th percentiles of survival time in females
ptime2[pct%in%c("0.25","0.5","0.75")][1] 208.2993 404.6370 682.9198
# Ratios is constant (accelerated failure-time assumption)
ptime1[pct%in%c("0.25","0.5","0.75")] / ptime2[pct%in%c("0.25","0.5","0.75")][1] 0.6732908 0.6732908 0.6732908
From the Weibull regression model, we can estimate both the hazard ratio and acceleration factor.
Comparing males to females, males have a 1.48 fold increased hazard (risk) of death compared to females. We are 95% confident the true hazard ratio is between a 1.16 and 1.91 fold increase. Since the hazard ratio does not contain 1, males are at a significantly increased hazard of death compared to females.
Comparing males to females, males have a median (or any other quantile) survival times that is 0.673 times the median (or other quantile) survival time in females. We are 95% confident that the true acceleration factor is between 0.52 and 0.85.
The statements are more awkward when the acceleration factor is less than 1. We could flip the estimates (e.g $1/0.673$) and say that the median (or other quantile) survival time in females is 1.49 times longer than it is in males to be less awkward.
Example shown here focused on a Weibull model as an introduction. We identified the Exponential distribution (a special case of the Weibull) is not a good fit.
Other (parametric) distributions are possible. Some have a proportional hazards interpretation (only) or an accelerated failure time interpretation (only). Other lack either of these interpretations, but are more flexible in how they model the hazard function. There is a trade-off between interpretability and flexibility.
Goal is to pick a distribution with a corresponding hazard (cumulative hazard) that is appropriate model for your data, will be reproducible, and is ultimately interpretable.
Cox model is a good choice if you want to model associations and not concern yourself with the shape of the hazard function. This is a very common situation and why the Cox model is widely used.
If goal is prediction, appropriate parametric models (perhaps with additional flexibility) may outperform the Cox model
Bayesian approach to fitting survival models are most often based on parametric approaches
It is relatively straight forward to fit Bayesian models using parametric distributions like Weibull
Models that attempt to recreate a Cox-like models are more complex because of the flexibility of the baseline hazard function